Computational Comparison and Classification of Dialects
نویسندگان
چکیده
In this paper a range of methods for measuring the phonetic distance between dialectal variants are described. It concerns variants of the frequency method, the frequency per word method and Levenshtein distance, both simple (based on atomic characters) and complex (based on feature bundles). The measurements between feature bundles used Manhattan distance, Euclidean distance or (a measure using) Pearson’s correlation coefficient. Variants of these using feature weighting by entropy reduction were systematically compared, as was the representation of diphthongs (as one symbol or two). The dialects were compared with each other directly and indirectly via a standard dialect. The results of comparison were classified by clustering and by training of a Kohonen map. The results were compared to wellestablished scholarship in dialectology, yielding a calibration of the methods. These results indicate that the frequency per word method and the Levenshtein distance outperform the frequency method, that feature representations are more sensitive, that Manhattan distance and Euclidean distance are good measures of phonetic overlap of feature bundles, that weighting is not useful, that two-phone representations of diphthongs mostly outperform one-phone representations, and that dialects should be directly compared to each other. The results of clustering give the sharper classification, but the Kohonen map is a nice supplement.
منابع مشابه
A Computational Perspective on the Romanian Dialects
In this paper we conduct an initial study on the dialects of Romanian. We analyze the differences between Romanian and its dialects using the Swadesh list. We analyze the predictive power of the orthographic and phonetic features of the words, building a classification problem for dialect identification.
متن کاملComparison and Classification of Dialects
This project measures and classifies language variation. In contrast to earlier dialectology, we seek a comprehensive characterization of (potentially gradual) differences between dialects, rather than a geographic delineation of (discrete) features of individual words or pronunciations. More general characterizations of dialect differences then become available. We measure phonetic (un)related...
متن کاملThe Short Vowels /i/ and /u/ in Iranian Balochi Dialects
The aim of the present paper is to study the status of the short vowels /i/ and /u/ in five selected Iranian Balochi dialects. These dialects are spoken in Sistan (SI), Saravan (SA), Khash (KH), Iranshahr (IR), and Chabahar (CH) regions located in province Sistan va Baluchestan in the southeast of Iran. This study investigates whether these two vowels have the same qualities as the short /i/ an...
متن کاملAutomatic Kurdish Dialects Identification
Automatic dialect identification is a necessary Language Technology for processing multidialect languages in which the dialects are linguistically far from each other. Particularly, this becomes crucial where the dialects are mutually unintelligible. Therefore, to perform computational activities on these languages, the system needs to identify the dialect that is the subject of the process. Ku...
متن کاملA Novel Fault Detection and Classification Approach in Transmission Lines Based on Statistical Patterns
Symmetrical nature of mean of electrical signals during normal operating conditions is used in the fault detection task for dependable, robust, and simple fault detector implementation is presented in this work. Every fourth cycle of the instantaneous current signal, the mean is computed and carried into the next cycle to discover nonlinearities in the signal. A fault detection task is complete...
متن کامل